perf(l1): parallelize header download with state download during snap sync by pablodeymo · Pull Request #6059 · lambdaclass/ethrex

pablodeymo · 2026-01-28T21:36:54Z

Summary

Moves header downloading to a background task that runs in parallel with state download
State download can now begin immediately instead of waiting for all headers to be downloaded
Adds incremental header processing at strategic points during snap_sync()
Fixes block_hashes ordering bug by replacing Vec<H256> with BTreeMap<u64, H256>

Motivation

Previously, sync_cycle_snap() downloaded ALL block headers before starting state download. This caused unnecessary delays since:

State download is independent of header download
Pivot selection only needs recent headers, not all of them
For mainnet with millions of blocks, header download alone takes significant time

Additionally, the block_hashes: Vec<H256> field stored hashes without block numbers. The numbers_and_hashes construction for forkchoice_update inferred block numbers from position (pivot_header.number - i). When the background header download and update_pivot interleaved inserts, entries ended up out of order, causing forkchoice_update to write wrong canonical hashes.

Implementation

Add header_receiver and download_complete fields to SnapBlockSyncState
Create download_headers_background() function that runs header download in a separate tokio task
Modify sync_cycle_snap() to spawn the background task and start state download immediately
Add process_pending_headers() helper to incrementally process headers at strategic points
snap_sync() waits for initial headers only when needed for pivot selection
Replace block_hashes: Vec<H256> with BTreeMap<u64, H256> keyed by block number — this naturally handles ordering and deduplicates overlapping ranges from update_pivot re-inserting the same block numbers
Build numbers_and_hashes directly from BTreeMap keys instead of inferring from position

Test plan

Run snap sync on testnet and verify parallel execution via logs
Verify state download starts before header download completes
Check memory usage to ensure channel doesn't buffer too many headers
Compare sync time before/after the change
Verify forkchoice_update writes correct canonical hashes after snap sync

…tory Reorganize state_healing.rs and storage_healing.rs into a shared sync/healing/ module structure with clearer naming conventions: - Create sync/healing/ directory with mod.rs, types.rs, state.rs, storage.rs - Rename MembatchEntryValue to HealingQueueEntry - Rename MembatchEntry to StorageHealingQueueEntry - Rename Membatch type to StorageHealingQueue - Rename children_not_in_storage_count to missing_children_count - Rename membatch variables to healing_queue throughout - Extract shared HealingQueueEntry and StateHealingQueue types to types.rs - Update sync.rs imports to use new healing module

Reorganize snap protocol code for better maintainability: - Split rlpx/snap.rs into rlpx/snap/ directory: - codec.rs: RLP encoding/decoding for snap messages - messages.rs: Snap protocol message types - mod.rs: Module re-exports - Split snap.rs into snap/ directory: - constants.rs: Snap sync constants and configuration - server.rs: Snap protocol server implementation - mod.rs: Module re-exports - Move snap server tests to dedicated tests/ directory - Update imports in p2p.rs, peer_handler.rs, and code_collector.rs

Document the phased approach for reorganizing snap sync code: - Phase 1: rlpx/snap module split - Phase 2: snap module split with server extraction - Phase 3: healing module unification

Split the large sync.rs (1631 lines) into focused modules: - sync/full.rs (~260 lines): Full sync implementation - sync_cycle_full(), add_blocks_in_batch(), add_blocks() - sync/snap_sync.rs (~1100 lines): Snap sync implementation - sync_cycle_snap(), snap_sync(), SnapBlockSyncState - store_block_bodies(), update_pivot(), block_is_stale() - validate_state_root(), validate_storage_root(), validate_bytecodes() - insert_accounts(), insert_storages() (both rocksdb and non-rocksdb) - sync.rs (~285 lines): Orchestration layer - Syncer struct with start_sync() and sync_cycle() - SyncMode, SyncError, AccountStorageRoots types - Re-exports for public API

…p/client.rs Move all snap protocol client-side request methods from peer_handler.rs to a dedicated snap/client.rs module: - request_account_range and request_account_range_worker - request_bytecodes - request_storage_ranges and request_storage_ranges_worker - request_state_trienodes - request_storage_trienodes Also moves related types: DumpError, RequestMetadata, SnapClientError, RequestStateTrieNodesError, RequestStorageTrieNodes. This reduces peer_handler.rs from 2,060 to 670 lines (~68% reduction), leaving it focused on ETH protocol methods (block headers/bodies). Added SnapClientError variant to SyncError for proper error handling. Updated plan_snap_sync.md to mark Phase 4 as complete.

…napError type Implement Phase 5 of snap sync refactoring plan - Error Handling. - Create snap/error.rs with unified SnapError enum covering all snap protocol errors - Update server functions (process_account_range_request, process_storage_ranges_request, process_byte_codes_request, process_trie_nodes_request) to return Result<T, SnapError> - Remove SnapClientError and RequestStateTrieNodesError, consolidate into SnapError - Keep RequestStorageTrieNodesError struct for request ID tracking in storage healing - Add From<SnapError> for PeerConnectionError to support error propagation in message handlers - Update sync module to use SyncError::Snap variant - Update healing modules (state.rs, storage.rs) to use new error types - Move DumpError struct to error.rs module - Update test return types to use SnapError - Mark Phase 5 as completed in plan document All phases of the snap sync refactoring are now complete.

Change missing_children_count from u64 to usize in HealingQueueEntry and node_missing_children function to match StorageHealingQueueEntry and be consistent with memory structure counting conventions.

Resolve conflicts from #5977 and #6018 merge to main: - Keep modular sync structure (sync.rs delegates to full.rs and snap_sync.rs) - Keep snap client code in snap/client.rs (removed from peer_handler.rs) - Add InsertingAccountRanges metric from #6018 to snap_sync.rs - Remove unused info import from peer_handler.rs

Previously, sync_cycle_snap() downloaded ALL block headers before starting state download. This caused unnecessary delays since state download is independent of header download and the pivot selection only needs recent headers. This change moves header downloading to a background task that runs in parallel with state download: - Add header_receiver and download_complete fields to SnapBlockSyncState - Create download_headers_background() function that runs header download in a separate tokio task and sends headers through an mpsc channel - Modify sync_cycle_snap() to spawn the background task and start state download immediately - Add process_pending_headers() helper to incrementally process headers at strategic points during snap_sync() - snap_sync() waits for initial headers only when needed for pivot selection This allows state download to begin much sooner, potentially reducing overall sync time significantly for mainnet syncs.

github-actions · 2026-01-28T21:39:14Z

Lines of code report

Total lines added: 158
Total lines removed: 0
Total lines changed: 158

Detailed view

+------------------------------------------------+-------+------+
| File                                           | Lines | Diff |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs | 1295  | +157 |
+------------------------------------------------+-------+------+
| ethrex/crates/storage/store.rs                 | 2648  | +1   |
+------------------------------------------------+-------+------+

greptile-apps · 2026-01-28T21:40:48Z

Greptile Summary

This PR parallelizes block header download with state download during snap sync, so state acquisition no longer stalls behind full-chain header retrieval. It also fixes a canonical-ordering bug by replacing the positional Vec<H256> with a BTreeMap<u64, H256> keyed by block number.

download_headers_background runs as a Tokio task, persisting each batch to the store and streaming it to the main task via a bounded channel; a CompletionGuard ensures the completion flag is always flipped on any exit path.
Abort checks (should_abort_snap_sync) are scattered through snap_sync so the main task bails out quickly when the background task detects a switch to full sync.
forkchoice_update is called inside snap_sync after a non-blocking process_pending_headers drain, but there is no barrier that waits for the background task to finish; headers downloaded after that drain are persisted to the store but never reflected in the canonical mapping.

Confidence Score: 4/5

Safe to merge on testnets where header download completes quickly; on long chains the canonical chain for pre-pivot historical blocks can end up incomplete when state download outpaces header download.

The RAII CompletionGuard, BTreeMap ordering fix, abort hooks, and full-sync hand-off are all well-constructed. The one concrete defect is that forkchoice_update is called inside snap_sync before the background header task is drained — the process_pending_headers call just before it is non-blocking, so headers arriving slightly later are persisted to the store but never given canonical block-number mappings. This is the exact scenario the parallelisation optimises for, making it a real path on long chains.

crates/networking/p2p/sync/snap_sync.rs — specifically the forkchoice_update call site and the missing drain-until-terminal loop before it.

Important Files Changed

Filename	Overview
crates/networking/p2p/sync/snap_sync.rs	Core snap sync rewrite: adds background header download via channel with RAII CompletionGuard, BTreeMap for ordered block_hashes, and abort hooks throughout — but forkchoice_update is called before the background task finishes, which can leave canonical block mappings incomplete when state download outpaces header download.
crates/networking/p2p/sync.rs	Adds probe_sync_head_number pre-check before committing to snap sync, imports MIN_FULL_BLOCKS for the devnet guard, and wires snap_enabled_arc into sync_cycle_snap; logic is straightforward and correct.

Sequence Diagram

sequenceDiagram
    participant SC as sync_cycle_snap
    participant BG as download_headers_background
    participant SS as snap_sync
    participant Store

    SC->>BG: tokio::spawn(download_headers_background)
    SC->>SS: snap_sync(..., snap_enabled_arc)

    par Header download
        loop until sync_head found
            BG->>Store: add_block_headers + set_checkpoint
            BG-->>SS: header_sender.send(batch)
        end
        BG->>BG: "CompletionGuard drop → download_complete=true"
        Note over BG: if head_found: snap_enabled_arc=false
    and State download
        SS->>SS: wait for initial headers
        SS->>SS: request_account_range
        SS->>SS: process_pending_headers (try_recv)
        SS->>SS: request_storage_ranges
        SS->>SS: process_pending_headers (try_recv)
        SS->>SS: heal_state_trie / heal_storage_trie
        SS->>SS: process_pending_headers (try_recv)
        SS->>SS: download bytecodes
        SS->>SS: process_pending_headers (NON-BLOCKING)
        Note over SS: BG may still be running
        SS->>Store: forkchoice_update(partial block_hashes)
    end

    SC->>BG: header_download_handle.await
    Note over SC,BG: headers persisted after forkchoice_update never get canonical mapping
    SC->>Store: clear_snap_state
    SC->>SC: "snap_enabled = false"

Comments Outside Diff (1)

crates/networking/p2p/sync/snap_sync.rs, line 893-910 (link)

forkchoice_update called before background header download completes

When snap state download (account ranges → healing → bytecodes) finishes before the background task has found sync_head, snap_enabled is still true, no abort fires, and forkchoice_update is called here with only the headers received via channel so far. The background task then continues persisting headers to the store — but because forkchoice_update was already called, those blocks never get canonical block-number→hash mappings. After sync_cycle_snap joins the background task, neither a second forkchoice_update nor any re-run fills in the gaps. The result is a permanently incomplete canonical chain for pre-pivot historical blocks.

The scenario is not hypothetical: on mainnet (millions of blocks to download) state download is likely to finish before header download, which is exactly the parallel-execution win this PR targets.

A minimal fix is to drain all remaining headers from the background task before building numbers_and_hashes — loop on recv_headers_timeout until download_status().is_terminal(). Alternatively, move forkchoice_update out of snap_sync into sync_cycle_snap, after the header_download_handle.await join.

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 893-910

Comment:
**`forkchoice_update` called before background header download completes**

When snap state download (account ranges → healing → bytecodes) finishes before the background task has found `sync_head`, `snap_enabled` is still `true`, no abort fires, and `forkchoice_update` is called here with only the headers received via channel so far. The background task then continues persisting headers to the store — but because `forkchoice_update` was already called, those blocks never get canonical block-number→hash mappings. After `sync_cycle_snap` joins the background task, neither a second `forkchoice_update` nor any re-run fills in the gaps. The result is a permanently incomplete canonical chain for pre-pivot historical blocks.

The scenario is not hypothetical: on mainnet (millions of blocks to download) state download is likely to finish before header download, which is exactly the parallel-execution win this PR targets.

A minimal fix is to drain all remaining headers from the background task before building `numbers_and_hashes` — loop on `recv_headers_timeout` until `download_status().is_terminal()`. Alternatively, move `forkchoice_update` out of `snap_sync` into `sync_cycle_snap`, after the `header_download_handle.await` join.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
crates/networking/p2p/sync/snap_sync.rs:893-910
**`forkchoice_update` called before background header download completes**

When snap state download (account ranges → healing → bytecodes) finishes before the background task has found `sync_head`, `snap_enabled` is still `true`, no abort fires, and `forkchoice_update` is called here with only the headers received via channel so far. The background task then continues persisting headers to the store — but because `forkchoice_update` was already called, those blocks never get canonical block-number→hash mappings. After `sync_cycle_snap` joins the background task, neither a second `forkchoice_update` nor any re-run fills in the gaps. The result is a permanently incomplete canonical chain for pre-pivot historical blocks.

The scenario is not hypothetical: on mainnet (millions of blocks to download) state download is likely to finish before header download, which is exactly the parallel-execution win this PR targets.

A minimal fix is to drain all remaining headers from the background task before building `numbers_and_hashes` — loop on `recv_headers_timeout` until `download_status().is_terminal()`. Alternatively, move `forkchoice_update` out of `snap_sync` into `sync_cycle_snap`, after the `header_download_handle.await` join.

### Issue 2 of 2
crates/networking/p2p/sync/snap_sync.rs:1083-1089
**`process_incoming_headers` errors on empty response from `request_block_headers`**

If `request_block_headers` returns `Some(vec![])` (an empty list rather than `None`), `ok_or(SyncError::NoBlockHeaders)` succeeds and `process_incoming_headers` is called with an empty iterator. Inside that function, `last_hash.ok_or(SyncError::InvalidRangeReceived)?` then returns an error, causing `update_pivot` to fail with a misleading error. Adding a guard `if block_headers.is_empty() { return Err(SyncError::NoBlockHeaders); }` between the two statements would give a clearer error and prevent an unexpected failure path.

_{Reviews (2): Last reviewed commit: "Merge branch 'main' into feature/backgro..." | Re-trigger Greptile}

greptile-apps

_{1 file reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-28T21:40:52Z

+    );
+
+    // Create channel for header communication between background task and main snap_sync
+    let (header_sender, header_receiver) = tokio::sync::mpsc::channel(100);


Channel buffer size of 100 could cause backpressure if header download significantly outpaces state download. Consider making buffer size configurable or calculating based on expected header batch sizes

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/snap_sync.rs Line: 285:285 Comment: Channel buffer size of 100 could cause backpressure if header download significantly outpaces state download. Consider making buffer size configurable or calculating based on expected header batch sizes <sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub> How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-01-28T21:40:54Z

+        self.download_complete
+            .as_ref()
+            .map(|flag| flag.load(Ordering::Acquire))
+            .unwrap_or(true)
+    }


Using Ordering::Acquire here is correct for reading the flag. However, consider the semantics: when download_complete is None, returning true means "download is complete", which could be misleading since no download was started. Consider documenting this behavior or returning a different value.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/snap_sync.rs Line: 89:93 Comment: Using `Ordering::Acquire` here is correct for reading the flag. However, consider the semantics: when `download_complete` is `None`, returning `true` means "download is complete", which could be misleading since no download was started. Consider documenting this behavior or returning a different value. <sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub> How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-01-28T21:40:57Z

+    while block_sync_state.block_hashes.is_empty() {
+        if block_sync_state.is_download_complete() {
+            return Err(SyncError::NoBlockHeaders);
+        }
+        // Process any available headers
+        process_pending_headers(block_sync_state).await?;
+        if block_sync_state.block_hashes.is_empty() {
+            debug!("Waiting for initial headers from background task...");
+            tokio::time::sleep(Duration::from_millis(100)).await;
+        }
+    }


This busy-wait loop with 100ms sleeps could be inefficient. Consider using a blocking receive with timeout or a condvar-based approach for the initial header wait

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/snap_sync.rs Line: 373:383 Comment: This busy-wait loop with 100ms sleeps could be inefficient. Consider using a blocking receive with timeout or a condvar-based approach for the initial header wait How can I resolve this? If you propose a fix, please make it concise.

in the multisync monitoring script (docker_monitor.py). The sync completion logs already contain per-phase completion markers (e.g. "✓ BLOCK HEADERS complete: 25,693,009 headers in 0:29:00") but this data was not surfaced in the Slack messages or run summaries. This adds a parse_phase_timings() function that reads saved container logs and extracts timing, count, and duration for all 8 snap sync phases: Block Headers, Account Ranges, Account Insertion, Storage Ranges, Storage Insertion, State Healing, Storage Healing, and Bytecodes. The breakdown is appended to both the Slack notification (as a code block per network instance) and the text-based run log (run_history.log and per-run summary.txt). When a phase did not complete (e.g. on a failed run), it is simply omitted from the breakdown.

pablodeymo · 2026-02-05T19:26:53Z

  Last 24h Runs (branch: feature/background-header-download, commit 441634d01)
  ┌─────┬──────────────┬─────────┬─────────┬─────────────┬────────────────────────┐
  │ Run │   Started    │ Network │ Result  │  Sync Time  │    Post-sync Block     │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #94 │ Feb 4, 04:30 │ mainnet │ SUCCESS │ 4h 37m      │ 24382518 (+111 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #95 │ Feb 4, 09:36 │ mainnet │ SUCCESS │ 6h 31m      │ 24384610 (+106 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #96 │ Feb 4, 16:35 │ mainnet │ SUCCESS │ 4h 40m      │ 24386126 (+110 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #97 │ Feb 4, 21:43 │ mainnet │ SUCCESS │ 4h 49m      │ 24387704 (+109 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #98 │ Feb 5, 03:00 │ mainnet │ SUCCESS │ 4h 43m      │ 24389248 (+110 blocks) │
  ├─────┼──────────────┼─────────┼─────────┼─────────────┼────────────────────────┤
  │ #99 │ Feb 5, 08:12 │ mainnet │ FAILED  │ 8h+ timeout │ -                      │
  └─────┴──────────────┴─────────┴─────────┴─────────────┴────────────────────────┘

pablodeymo · 2026-02-05T19:28:03Z

⏺ Here's the timeline for Run #99:

Run #99 Timeline (PR #6059 - feature/background-header-download)
Time: 08:19
Event: Sync started (mainnet from block 0)
Elapsed: 0
────────────────────────────────────────
Time: 11:54
Event: Bytecodes download started
Elapsed: 3h 35m
────────────────────────────────────────
Time: 13:07
Event: Bytecodes download finished (last batch: 42,217)
Elapsed: 4h 48m
────────────────────────────────────────
Time: 13:07
Event: validate_bytecodes started
Elapsed: 4h 48m
────────────────────────────────────────
Time: 13:44
Event: First sync cycle finished (19,490s = 5h 25m total snap sync)
Elapsed: 5h 25m
────────────────────────────────────────
Time: 13:44-16:18
Event: Catch-up sync cycles (~4 min each, continuously)
Elapsed: 5h 25m - 8h+
────────────────────────────────────────
Time: ~16:18
Event: 8h timeout hit — monitor killed it
Elapsed: 8h 6m
Root Cause

The snap sync itself completed in 5h 25m (within normal range — previous runs were 4h
26m-6h 31m). The problem is what happened after:

After sync completed, the node entered catch-up mode doing full sync cycles every ~4
minutes
It kept doing these cycles for 2.5+ hours without the monitor detecting it as "done"
Meanwhile, the background header download feature was still downloading headers in
parallel (log ends showing headers being fetched from block ~24.3M downward)
The node never transitioned to the "processing new blocks" state that the monitor expects
within 22 minutes

Comparison with Successful Runs

In successful runs #92-#98, after snap sync the node caught up and started processing new
blocks within ~22 minutes. In #99, it was stuck in catch-up sync cycles for 2.5h+. This
could be:

The background header download competing for resources/bandwidth with the catch-up sync,
slowing it down
A larger gap to catch up (the sync took 5h25m vs typical 4h35m, meaning the chain
advanced further during sync)
Network conditions — many WARN about invalid/empty headers from peers throughout the
tail of the log

handling to fix clippy lint error

proof conversion helpers (encodable_to_proof, proof_to_encodable) from server.rs to mod.rs since they are shared utilities used by both client and server.

peer selection, extract magic numbers into named constants (ACCOUNT_RANGE_CHUNK_COUNT, STORAGE_BATCH_SIZE, HASH_MAX), remove unused contents field from DumpError and use derive Debug, rename missing_children to pending_children in healing code, and wrap process_byte_codes_request in spawn_blocking for consistency with other server handlers.

…functions so they no longer depend on the PeerHandler type as a method receiver. The three methods that used self (request_account_range, request_bytecodes, request_storage_ranges) now take peers: &mut PeerHandler as a parameter, and the two already-static methods (request_state_trienodes, request_storage_trienodes) simply move out of the impl block. Callers in snap_sync.rs, healing/state.rs, and healing/storage.rs are updated accordingly.

…ation' into feature/background-header-download

…header-download

Copilot

Pull request overview

This PR aims to improve L1 snap sync performance by downloading block headers concurrently with snap state download, so state sync can start immediately instead of waiting for all headers.

Changes:

Introduces a background header-download task and incremental header processing during snap_sync().
Adds snap protocol server tests for Hive AccountRange vectors.
Minor refactor in PeerHandler::request_block_bodies_inner and updates the changelog.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`crates/networking/p2p/sync/snap_sync.rs`	Runs header download in a background task and incrementally processes headers during snap state sync.
`crates/networking/p2p/tests/snap_server_tests.rs`	Adds Hive-derived `AccountRange` tests and a local in-memory state fixture.
`crates/networking/p2p/peer_handler.rs`	Small refactor to store the request result before pattern matching.
`CHANGELOG.md`	Records the perf change entry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-09T14:59:52Z

            return super::full::sync_cycle_full(
-                peers,
+                &mut peers,
                blockchain,
                tokio_util::sync::CancellationToken::new(),
                sync_head,


download_headers_background() triggers full::sync_cycle_full() inside the background header task. That causes full sync to run concurrently with the ongoing snap_sync() in the main task (both touching the store and consuming peers), and it also contradicts the comment that the main task will handle the switch. Instead, the background task should only signal the mode switch (e.g., set the flag + mark download complete) and return, letting sync_cycle_snap() abort snap sync and then run full sync in a single place.

Copilot · 2026-02-09T14:59:53Z

+        // Handle reorg case where sync head is not reachable
+        let current_head = store
+            .get_block_header(current_head_number)?
+            .map(|h| h.hash())
+            .unwrap_or(first_block_hash);


The reorg/stuck detection uses store.get_block_header(current_head_number) to derive current_head, but the background task is no longer persisting downloaded headers to the store. After the first batch, current_head_number will likely be beyond the local canonical head, so this lookup returns None and the check becomes ineffective/misleading. Track the last fetched head hash in a local variable (as the previous code did) or persist headers in the background task before using store-based checks.

Suggested change

// Handle reorg case where sync head is not reachable

let current_head = store

.get_block_header(current_head_number)?

.map(|h| h.hash())

.unwrap_or(first_block_hash);

// Handle reorg case where sync head is not reachable.

// Use the last fetched header as the current head instead of relying on the store,

// since the background task does not persist headers into the store.

let current_head = last_block_hash;

Copilot · 2026-02-09T14:59:53Z

+        // Check if we should abort (full sync triggered)
+        if should_abort_snap_sync(snap_enabled) {
+            info!("Snap sync aborted: switching to full sync");
+            return Ok(());


When should_abort_snap_sync() is true, snap_sync() returns Ok(()), and sync_cycle_snap() continues to clear snap state and finish successfully. This makes the mode-switch path ambiguous: if the background task is changed to only signal (and not run full sync itself), the caller will silently skip full sync. Consider returning a distinct status/error from snap_sync() (e.g., "SwitchToFullSync") and have sync_cycle_snap() perform the full sync (and skip clear_snap_state() if full sync is about to run).

Suggested change

return Ok(());

// Signal the caller via an error so it does not treat this as a

// successful completion and can perform the appropriate mode switch.

return Err(SyncError::NoBlockHeaders);

Copilot · 2026-02-09T14:59:54Z

+    // So I copied the state from a geth execution of the test suite
+
+    // State was trimmed to only the first 100 accounts (as the furthest account used by the tests is account 87)
+    // If the full 408 account state is needed check out previous commits the PR that added this code


Minor grammar nit in this comment: "check out previous commits the PR" is missing a preposition (e.g., "previous commits of the PR").

Suggested change

// If the full 408 account state is needed check out previous commits the PR that added this code

// If the full 408 account state is needed, check out previous commits of the PR that added this code

…nc full sync loop After snap sync completes, the node switches to full sync mode. Full sync downloads headers backward from the tip looking for a block that is_canonical_sync recognizes. This check requires entries in CANONICAL_BLOCK_HASHES (number→hash). The background header download task stored headers via add_block_headers, which only wrote to HEADERS and BLOCK_NUMBERS but not CANONICAL_BLOCK_HASHES. This caused full sync to never find a canonical ancestor, looping endlessly with "Sync failed to find target block header" until the 8h timeout. Fix: add CANONICAL_BLOCK_HASHES write to the existing transaction in add_block_headers (which has exactly one caller: snap sync background download).

pablodeymo · 2026-02-10T18:49:28Z

Attaching the investigation of the bug:
BACKGROUND_HEADER_DOWNLOAD_INVESTIGATION.md

…ng bug The block_hashes field stored hashes without block numbers, so the numbers_and_hashes construction for forkchoice_update inferred block numbers from position (pivot_header.number - i). When background header download and update_pivot interleaved inserts, entries ended up out of order, causing forkchoice_update to write wrong canonical hashes. BTreeMap keyed by block number naturally handles ordering and deduplicates overlapping ranges from update_pivot re-inserting the same block numbers.

ElFantasma · 2026-02-09T23:35:25Z

    }

-    snap_sync(peers, &store, &mut block_sync_state, datadir).await?;
+    download_complete.store(true, Ordering::Release);


Bug: download_complete is only set to true in the happy path (here) and in two specific early-returns (max attempts, full-sync switch). But if the function returns via ? error propagation (e.g., peers.request_block_headers(...) fails, store.add_block_headers(...) fails, etc.), download_complete is never set to true.

When this happens, the sender half of the channel is dropped (task exits), but snap_sync()'s initial-headers wait loop (line ~435) checks download_status().is_terminal() which returns false (flag still false), so it keeps calling recv_headers_timeout() → gets None (closed channel) → loops with 100ms sleeps forever.

Suggestion: use an RAII guard pattern to ensure download_complete is always set on exit:

struct DownloadCompleteGuard(Arc<AtomicBool>); impl Drop for DownloadCompleteGuard { fn drop(&mut self) { self.0.store(true, Ordering::Release); } } // At the start of download_headers_background: let _guard = DownloadCompleteGuard(download_complete.clone());

Then remove the manual download_complete.store(true, ...) calls.

ElFantasma · 2026-02-09T23:35:25Z

+            );
+            // We can't easily go back in the background task, so we just continue with the current head
+            // The update_pivot mechanism will handle this case
+            tokio::time::sleep(Duration::from_millis(100)).await;


The original code handled the reorg case by walking backward: current_head = first_block_parent_hash. The background task can't do this because current_head was removed in favor of current_head_number, and going back requires knowing the parent hash.

But the current code just sleeps 100ms and retries with the same current_head_number, which will get the same headers back and loop indefinitely. The comment says "The update_pivot mechanism will handle this case" but update_pivot runs in snap_sync() on the main task — it doesn't feed back into the background task's current_head_number.

Consider either:

Decrement current_head_number by 1 to walk backward (simpler, analogous to the old parent_hash approach)

Add a maximum retry count for this specific case

Use _first_block_parent_hash (currently prefixed with _) to look up the parent's block number

ElFantasma · 2026-02-09T23:35:25Z

-                .process_incoming_headers(block_headers_iter)
-                .await?;
+        // Send headers through channel (skip the first as we already have it)
+        if block_headers.len() > 1 && header_sender.send(block_headers).await.is_err() {


The first header is skipped by the receiver in process_pending_headers() (via .skip(1)), but here the full batch including the first header is sent. This means every batch sent through the channel carries one header that will be discarded. Not a bug, but it's a subtle contract — the sender and receiver must agree on the "skip first" convention.

More importantly: the original code only skipped the first header of the first batch (to avoid the already-known current head). Here, .skip(1) is applied to every batch received from the channel. If the background task sends multiple batches, the first header of each batch (which is the continuation point from the previous batch's last header) gets silently dropped. This could cause gaps in block_hashes — missing one header per batch.

Double-check that the header fetching always returns overlapping ranges (i.e., the first header of batch N+1 == the last header of batch N). If so, the skip is correct. If not, this introduces gaps.

ElFantasma · 2026-02-09T23:35:25Z

+    block_sync_state.set_header_channel(header_receiver, download_complete.clone());
+
+    // Create Arc wrapper for snap_enabled so we can share it with the background task
+    let snap_enabled_arc = Arc::new(AtomicBool::new(snap_enabled.load(Ordering::Relaxed)));


nit: Creating a separate Arc<AtomicBool> copy of snap_enabled means the Syncer's original snap_enabled (shared with SyncManager) doesn't see the full-sync switch until snap_sync() completes and the sync-back on line ~389 runs. If SyncManager or other code checks snap_enabled concurrently during the sync cycle, it'll still see true even though the background task decided to switch to full sync.

In practice this is probably fine since the sync cycle is the only writer, but it's worth a comment explaining why a copy is used instead of sharing the original Arc.

Conflicts in peer_handler.rs and snap_sync.rs from main's permit-based peer-request refactor (#6523), observability (#6470), and pivot fix (#6475). Adopted main's permit pattern. Kept the new download_headers_background function and extended sync_cycle_snap/snap_sync signatures to accept the diagnostics handle from main alongside the snap_enabled flag from this PR. Test fixes: - snap_server_tests.rs: updated state_trie.hash() to take &NativeCrypto - added #![allow(clippy::unwrap_used)] for lazy_static block (test file)

Resolves conflicts in CHANGELOG.md and crates/networking/p2p/sync/snap_sync.rs, and addresses the unresolved review feedback in the same commit: - Completion guard: download_complete is now flipped via Drop, so every return path (including ? propagation) signals the main task. Resolves comment #6059 on download_headers_background error paths. - Background task no longer calls sync_cycle_full; it sets snap_enabled=false and returns. Syncer::sync_cycle picks up the full-sync transition on the next cycle. Eliminates the Store/peer concurrency between snap_sync (main) and full_sync (background). Resolves the Copilot review on line 297. - Headers are persisted to the store as each batch arrives, restoring reorg detection and checkpoint-resume semantics that the original sequential path provided. Resolves the Copilot review on line 236 and the ElFantasma review on line 247. - current_head is tracked as a hash locally in the background loop (mirrors the original walk-back-via-parent_hash behaviour for the reorg case). - The sender strips the duplicate first header from each batch; receivers no longer skip(1). Resolves the ElFantasma contract review on line 304. - Channel buffer extracted to HEADER_CHANNEL_BUFFER. Resolves the greptile buffer-size review. - Grammar fix in snap_server_tests.rs. Resolves the Copilot grammar nit.

ElFantasma · 2026-05-29T12:12:47Z

Review feedback addressed in `137bfd5`

Resolves conflicts with main and addresses the unresolved review comments. Summary by reviewer:

From @ElFantasma

✅ L315 — download_complete not set on ? paths: replaced manual flag-sets with an RAII CompletionGuard that flips the flag on Drop. Every return path (happy, ?, panic unwind) now signals completion.
✅ L247 — reorg detection broken: current_head is now tracked as a hash locally in the background loop, mirroring main's original walk-back-via-first_block_parent_hash recovery. No longer depends on a store lookup that fails once we fetch past the local canonical head.
✅ L304 — first-header skip contract mismatch: the sender now strips the duplicate first header from each batch via split_off(1); receivers no longer call .skip(1). Single source of truth for the channel contract.
⚠️ L355 — Arc<AtomicBool> copy of snap_enabled nit: not addressed. Tying Syncer::snap_enabled and the background task to a single shared Arc would change sync_cycle_snap's signature into the caller (Syncer::sync_cycle); happy to do this in a follow-up if you want.

From @Copilot

✅ L297 — full sync running concurrently in background task: the background task no longer calls sync_cycle_full. It sets snap_enabled = false + signals completion and returns. Syncer::sync_cycle runs full sync on the next cycle. Eliminates Store/peer concurrency between snap and full.
✅ L236 — reorg detection vs. unpersisted headers: now persists each batch to the store (add_block_headers + set_header_download_checkpoint) before sending via channel. Local current_head_hash covers the case where the store read can't be trusted (e.g. canonical-head divergence).
✅ L458 — mode-switch ambiguity: with the previous fix, the background task no longer returns through a path that could be confused with normal completion. snap_sync() aborting on snap_enabled=false is the single signal, and sync_cycle_snap cleans up + returns; the caller routes to full sync on the next cycle. No event mode confusion.
✅ snap_server_tests.rs:294: grammar fix applied.

From @greptile-apps

✅ L348 — channel buffer 100 hardcoded: extracted to HEADER_CHANNEL_BUFFER constant with documenting comment.
✅ L93 — is_download_complete() semantics: superseded by the DownloadStatus enum (NotStarted / InProgress / Complete) which makes the "no download started" case explicit. is_terminal() documents that both terminal states satisfy "no more headers expected."
✅ L471 — busy-wait loop: superseded by recv_headers_timeout() which uses tokio::time::timeout(receiver.recv()) — proper async wait, not a 100ms-sleep loop.
ℹ️ L305 — silent send error: left as-is per the prior reply (would be too spammy).

Code is cargo check / cargo fmt / cargo clippy clean.

pablodeymo added 17 commits January 22, 2026 13:19

docs: add snap sync refactoring plan

9f9214f

Document the phased approach for reorganizing snap sync code: - Phase 1: rlpx/snap module split - Phase 2: snap module split with server extraction - Phase 3: healing module unification

Merge branch 'main' into refactor/snapsync-healing-unification

57ebf37

fix(l1): use consistent usize type for missing_children_count

f680777

Change missing_children_count from u64 to usize in HealingQueueEntry and node_missing_children function to match StorageHealingQueueEntry and be consistent with memory structure counting conventions.

fix(l1): fix typos in healing module comments

17744dc

fix(l1): fix typo in snap client error message

c40de29

fix(l1): prevent panic on empty accounts vector in snap client

ffba3fe

fix(l1): handle empty bytecode hashes in request_bytecodes

a682d76

fix(l1): prevent panics from empty vector indexing in snap client

6c99453

fix(l1): prevent zero chunk_size in request_account_range

91b926e

Merge branch 'main' into refactor/snapsync-healing-unification

f3983c3

pablodeymo requested a review from a team as a code owner January 28, 2026 21:36

github-actions Bot assigned pablodeymo Jan 28, 2026

pablodeymo changed the title ~~Parallelize header download with state download during snap sync~~ perf(l1): parallelize header download with state download during snap sync Jan 28, 2026

github-actions Bot added L1 Ethereum client performance Block execution throughput and performance in general labels Jan 28, 2026

github-project-automation Bot added this to ethrex_performance and ethrex_l1 Jan 28, 2026

github-project-automation Bot moved this to Todo in ethrex_performance Jan 28, 2026

greptile-apps Bot reviewed Jan 28, 2026

View reviewed changes

pablodeymo added 2 commits January 28, 2026 19:12

Apply rustfmt formatting to snap sync files

883b17e

clippy fix

e15518b

pablodeymo added 2 commits February 5, 2026 16:07

Merge branch 'main' into refactor/snapsync-healing-unification

d01f055

pablodeymo added 7 commits February 5, 2026 16:44

Replace single-pattern match with if let in block bodies response

78c7b68

handling to fix clippy lint error

Propagate errors instead of panicking in request_storage_ranges and move

1414a3b

proof conversion helpers (encodable_to_proof, proof_to_encodable) from server.rs to mod.rs since they are shared utilities used by both client and server.

Merge remote-tracking branch 'origin/refactor/snapsync-healing-unific…

cca520c

…ation' into feature/background-header-download

Fix cargo fmt issue in peer_handler.rs introduced by merge

2d136d8

Merge branch 'feature/slack-phase-breakdown' into feature/background-…

c78cddc

…header-download

Base automatically changed from refactor/snapsync-healing-unification to main February 6, 2026 21:52

pablodeymo added 2 commits February 6, 2026 19:03

Merge branch 'main' into feature/background-header-download

9235125

Merge branch 'main' into feature/background-header-download

e69e21e

Copilot AI review requested due to automatic review settings February 9, 2026 14:50

Copilot started reviewing on behalf of pablodeymo February 9, 2026 14:51 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

pablodeymo added 2 commits February 10, 2026 15:50

Merge branch 'main' into feature/background-header-download

0d3d3e6

ElFantasma requested changes Feb 11, 2026

View reviewed changes

github-project-automation Bot moved this to In Progress in ethrex_l1 Feb 11, 2026

ElFantasma and others added 5 commits May 4, 2026 14:54

Merge branch 'main' into feature/background-header-download

9bce394

Merge branch 'main' into feature/background-header-download

6b4e13f

Merge branch 'main' into feature/background-header-download

8c0b7a2

ElFantasma requested a review from a team as a code owner May 29, 2026 12:12

	// If the full 408 account state is needed check out previous commits the PR that added this code
	// If the full 408 account state is needed, check out previous commits of the PR that added this code

Conversation

pablodeymo commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Implementation

Test plan

Uh oh!

github-actions Bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

greptile-apps Bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

pablodeymo commented Feb 5, 2026

Uh oh!

pablodeymo commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

pablodeymo commented Feb 10, 2026

Uh oh!

ElFantasma Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma commented May 29, 2026

Review feedback addressed in 137bfd5

From @ElFantasma

From @Copilot

From @greptile-apps

Uh oh!

Reviewers

Assignees

Labels

Projects

pablodeymo commented Jan 28, 2026 •

edited

Loading

github-actions Bot commented Jan 28, 2026 •

edited

Loading

greptile-apps Bot commented Jan 28, 2026 •

edited

Loading

Review feedback addressed in `137bfd5`